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Non-linear latent variable models have become increasingly popular in a variety of applications. 
However, there has been little study on theoretical properties of these models. In this article, 
we study rates of posterior contraction in univariate density estimation for a class of non-linear 
latent variable models where unobserved U(0, 1) latent variables are related to the response 
variables via a random non-linear regression with an additive error. Our approach relies on 
characterizing the space of densities induced by the above model as kernel convolutions with 
a general class of continuous mixing measures. The literature on posterior rates of contraction 
in density estimation almost entirely focuses on finite or countably infinite mixture models. We 
develop approximation results for our class of continuous mixing measures. Using an appropriate 
Gaussian process prior on the unknown regression function, we obtain the optimal frequentist 
rate up to a logarithmic factor under standard regularity conditions on the true density. 
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1. Introduction 

Kernel mixture models arc known to be extremely flexible and have been extensively 
used for density estimation. Starting with a parametric kernel IC(jj, 0), one can obtain a 
class of densities fa as 

fa(y) = J 1C{y,6)dG{9), (1.1) 

where G(-) is a mixing distribution. In particular, by choosing G to be a discrete dis- 
tribution with finitely many atoms O^^h — 1, . . . , k having weights ir^, h = 1, . . . , k with 
2h=i — 1) one obtains the important class of finite mixture models. In a Bayesian 
framework, one can induce a prior distribution on the class of densities by assigning a 
prior to G, which amounts to specifying priors on k and {Ohi^hit h — 1, . . . , k in case of 
finite mixture models. A Dirichlet process (Ferguson, 1973, 1974) is often used as a de- 
fault prior on the class of mixing distributions due to its attractive theoretical properties 
and availability of efficient algorithms for posterior computation. Since realizations of a 
Dirichlet process are almost surely discrete (see Sethuraman (1994) for a constructive 
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definition), a Dirichlet process prior on G induces an infinite discrete mixture model for 
Jq. A well known drawback of finite mixture models is the sensitivity of the results to 
the choice of k, whereas updating k in a fully Bayesian formulation is computationally 
intensive. The infinite mixture representation avoids fixing a truncation level and sophis- 
ticated sampling algorithms such as Walker (2007) enable posterior sampling from the 
full posterior distribution. 

Although finite and infinite discrete mixture models have been extensively used, there 
are reasons to look beyond these classes of models. A discrete prior on G partitions the 
n subjects into one or more clusters, with subjects in the same cluster sharing the same 
9 value. Although this property has been widely exploited for probabilistic clustering, 
one might want to avoid the clustering phenomenon in situations where the interest is 
purely in density estimation and one is not interested in interpreting the clusters or 
in inferring the cluster specific parameters. It is often the case that the clusters don't 
have any physical significance and subjects get inappropriately grouped together for all 
parameter values obscuring subtle differences. In such cases, the clustering is more of 
an artifact of the model and a continuum among the parameter values for the subjects 
seems more reasonable. 

While Polya tree priors (Ferguson, 1974; Mauldin, Sudderth and Williams, 1992) can 
be directly used to induce priors on the space of absolutely continuous densities (Lavine, 
1992), the resulting density estimates are found to be spiky in practice. Lenk (1988, 
1991) proposed a logistic Gaussian process which bypasses the mixture formulation by 
directly modeling an unknown density on the unit interval as the exponent of a random 
function re-normalized, or equivalently modeling the log-density using a Gaussian process 
prior. The normalizing constant in the logistic Gaussian process models is analytically 
intractable and causes difficulties in posterior sampling. Refer to Tokdar (2007) for a 
faster implementation in density estimation with logistic Gaussian process priors. 

Recently, Kundu and Dunson (2011) proposed an approach for univariate density es- 
timation in which the response variables are modeled as unknown functions of uniformly 
distributed latent variables with an additive Gaussian error. The latent variable specifica- 
tion allows straightforward posterior computation via conjugate posterior updates. Since 
inverse c.d.f. transforms of uniform random variables can generate draws from any dis- 
tribution, by choosing the prior on the error variance to assign positive mass to arbitrary 
neighborhoods of zero while placing a prior with large support on the space of functions 
mapping the latent variables to the observed variables (referred to as the transfer func- 
tion from now on), their prior can approximate draws from any continuous distribution 
function arbitrarily closely. One can also conveniently center the non-parametric model 
on a parametric family by centering the prior on the transfer function on a parametric 
class of quantile (or inverse c.d.f.) functions {Fg 1 : 9 £ 9}. While such centering on 
parametric guesses can be achieved in Dirichlet process mixture models by appropriate 
choice of the base measure Go, posterior computation becomes complicated unless the 
base measure is conjugate to the kernel IC. 

There has been growing interest in studying asymptotic properties of Bayesian proce- 
dures assuming the data are sampled from a fixed unknown distribution. The posterior 
distribution is said to be strongly consistent if it concentrates almost surely in arbi- 
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trarily small L\ neighborhoods of the true distribution with increasing sample size. 
Ghosal, Ghosh and Ramamoorthi (1999) provided general conditions in terms of L\ 
metric entropy to ensure strong posterior consistency and verified those conditions for 
Dirichlet process location mixtures of normal kernels under certain regularity conditions. 
Tokdar (2006) extended their result to the location-scale mixture case while encom- 
passing a significantly larger class of "true" densities. Ghosal, Ghosh and van der Vaart 
(2000) considered the rate of contraction of a posterior distribution to the true density, 
providing an upper bound on the rate at which one can let the neighborhood size de- 
crease to zero. Ghosal and van der Vaart (2001) obtained rates of posterior contraction 
for the Dirichlet process mixture model when the true density is a location-scale mixture 
of normals with component specific standard deviations bounded between two positive 
numbers. Although a nearly parametric rate is obtained in this case, the above class 
of densities is restrictive since one needs the component specific standard deviations to 
be arbitrarily small for normal mixtures to be able to approximate any smooth density. 
Ghosal and van der Vaart (2007) developed a generalization of the basic rate theorem 
in Ghosal, Ghosh and van der Vaart (2000) and addressed a broader class of densities, 
namely, the class of twice continuously diffcrentiablc densities. Under some regularity 
conditions which include the requirement that the true density be compactly supported, 
they obtained the optimal minimax rate of n~ 2 / 5 up to a logarithmic factor based on 
Dirichlet process mixture models. Kruijer, Rousseau and van der Vaart (2010) consid- 
ered finite location-scale mixtures of exponential power distributions and obtained min- 
imax rates of convergence up to a logarithmic factor for any /3-H61der density, implying 
rate adaptivity to any degree of smoothness of the true density. 

In this article, we study rates of posterior contraction in univariate density estimation 
for a class of non-linear latent variable models (NL-LVM) similar to Kundu and Dunson 
(2011). The NL-LVM encompasses a large class of univariate densities and it is straight- 
forward to extend the class for multivariate density estimation and density regression 
problems. In particular, the NL-LVM has elements in common to Gaussian process 
latent variable models (GP-LVM) routinely used in machine learning applications for 
high-dimensional data visualization and dimensionality reduction (Lawrence, 2004, 2005; 
Lawrence and Moore, 2007; Ferris, Fox and Lawrence, 2007). However, the literature on 
GP-LVM doesn't provide any discussion on the flexibility of their specification in terms 
of the induced density of the observations after marginalizing out the latent variables. 
Although Kundu and Dunson (2011) provide an intuitive argument for large support in 
the density space for the univariate case, a rigorous characterization of the prior sup- 
port is missing. We provide an accurate characterization of the prior support in terms 
of kernel convolution with a class of continuous mixing measures. We provide conditions 
for the mixing measure to admit a density with respect to Lcbcsgue measure and show 
that the prior support of the NL-LVM is at least as large as that of DP mixture models. 
We then develop approximation results for the above class of continuous mixing mea- 
sures and subsequently derive posterior contraction rates assuming standard smoothness 
assumptions on the true density. Assuming the true density to be twice continuously 
diffcrentiable, the best obtainable rate is found to be the minimax rate of n~ 2 / 5 up to a 
logarithmic factor. Further, if the prior on the transfer function is centered on a paramet- 
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ric family which happens to contain the true density, then one gets a faster convergence 
rate which can be arbitrarily close to the parametric rate of n^ 1 / 2 up to a logarithmic 
factor. Also, analogous to the Dirichlet process mixture models, when the true density is 
a Gaussian convolution with a finite mixture of truncated Gaussians, one can also attain 
a near parametric convergence rate. 

The main contributions of this article are as follows, (i) The characterization of our 
model using convolutions implies that one can approximate any continuous density by 
choosing the transfer function to be the quantile function of the true density and letting 
the error variance to decrease to zero. When the true density is not compactly supported, 
the corresponding quantile function is unbounded with discontinuities at and 1 and 
it is not immediate whether a prior for the transfer function supported on C[0, 1] (a 
default choice being a Gaussian process prior) results in the optimal rate. To address 
this issue, we define a sequence of C[0, 1] functions that converge pointwise to the true 
quantile function and derive concentration bounds for the prior around this sequence, 
(ii) The traditional approach of approximating the Gaussian convolution of a compactly 
supported density by discrete normal mixtures isn't well-suited for our purpose since the 
quantile function of the mixing distribution is a step function which doesn't belong to the 
sup-norm support of any smooth stochastic process. We develop a technique based on 
maximum entropy moment matching (Mead and Papanicolaou, 1984) for approximating 
a compactly supported density by an infinitely smooth density. Although the above 
developments are crucially used for our treatment of the non-compact case, we believe 
these results will be of independent interest. 

The rest of the article is organized as follows. We introduce relevant notations and 
terminologies in Section 2. To make the article self-contained, we also provide a brief 
background on Gaussian process priors. In Section 3, we formulate our assumptions on 
the true density /o and in the following section, we describe the NL-LVM model and relate 
it to convolutions. We state our main theorem on convergence rates for the compact case 
in Section 5 and the non-compact case in Section 6. We discuss some special cases in 
Section 7. Section 8 discusses some implications of our results and outlines possible future 
directions. 

2. Notations 

Throughout the article, Yi, . . . , Y ni . . . are independent and identically distributed with 
density f £ J 7 , the set of all densities on K absolutely continuous with respect to 
the Lebesgue measure A. The suprcmum and Li-norm are denoted by || • || and 1 1 - 1 j 1 , 
respectively. We let \\-\\ pi/ denote the norm of L p (v), the space of measurable func- 
tions with z^-intcgrablc pth absolute power. For two density functions /, g £ J 7 , let h 
denote the Hellinger distance defined as h 2 (f,g) = \\\ff — \/g\\ 9 x = Jif 1 ^ 2 ~ ff 1 ^ 2 ) 2 ^, 
K (/, g) the Kullback-Leibler divergence given by K(f, g) = J \og(f /g)fd\ and V(f, g) = 
J log( / /g) 2 fdX. The notation C[0, 1] is used for the space of continuous functions / : 
[0, 1] — > K endowed with the supremum norm. For (3 > 0, we let C^[0, 1] denote the 
Holder space of order (3, consisting of the functions / £ C[0, 1] that have [f3\ continuous 
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derivatives with the [(3\th derivative /L^J being Lipschitz continuous of order /3 — 
The e-covering number N(e, S, d) of a semi-metric space S relative to the semi-metric 
d is the minimal number of balls of radius e needed to cover S. The logarithm of the 
covering number is referred to as the entropy. By near-optimal rate of convergence we 
mean optimal rate of convergence slowed down by a logarithmic factor. 

We write for inequality up to a constant multiple. Let 

4>{x) — (27r) -1 / 2 cxp(— x 2 /2) denote the standard normal density, and let (j) a {x) = 
(1 / 'o~)4>(x I 'a) . Let an asterisk denote a convolution e.g., (<fio- * f){y) = J (fto-iv ~ x)f(x)dx. 
The support of a density / is denoted by supp(/). 

We briefly recall the definition of the RKHS of a Gaussian process prior; a detailed 
review can be found in van der Vaart and van Zanten (2008b). A Borcl measurable ran- 
dom element W with values in a separable Banach space (B, ||-||) (e.g., C[0, 1]) is called 
Gaussian if the random variable b*W is normally distributed for any element b* el*, 
the dual space of B. The reproducing kernel Hilbert space (RKHS) H attached to a zero- 
mean Gaussian process W is defined as the completion of the linear space of functions 
1 1 y EW(t)H relative to the inner product 

(EW(-)H v ,EW(-)H 2 )b = EHiH 2 , 

where H,H\ and H2 are finite linear combinations of the form ^2 i o,iW(si) with a; G K 
and Si in the index set of W. 



3. Assumptions on the true density 

It has been widely recognized that one needs certain smoothness assumptions and tail 
conditions on the true density f to derive posterior convergence rates at fo- We need 
the following assumptions in our case, 

Assumption 3.1. fo is twice continuously differentiable with J (Jq / fo) 2 fadX < 00 and 

HfUhYkdx < 00. 

Remark 3.1. Letting fo(y) = Cexp{— Wa(y)} on supp(/o) so that u)q = logC — 
log fo(y), we can restate Assumption 3.1 as wo being twice continuously differentiable 
and 

/oo />oo 
{^(y)} 4 exp{-u;o(y)} < 00, / {w' '{y)} 2 exp{-w (y)} < 00. (3.1) 
-00 J — 00 

Assumption 3.2. fo is bounded, nondecreasing on (—00, a], bounded away from on 
[a, b] and non-increasing on [6,00) for some a <b. 

Assumption 3.1 is the same as Assumption 1.2 of Ghosal and van der Vaart (2007) and 
ensures that h(fo, fo * <f>a) — 0(a 2 ) as a — > 0; see Lemma 4 of Ghosal and van der Vaart 
(2007) for a proof. Assumption 3.2 is the same as the assumption in Lemma 6 of the same 
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paper. This is sufficient to guarantee that for every S > 0, there exists a constant C > 
such that fo * <pa > Cfo for every a < 8. While Assumption 3.1 only allows sufficiently 
smooth densities, Assumption 3.2 is only a mild requirement in the sense that most 
reasonable densities arising in practice should satisfy it. Moreover, if fo is nondecreasing 
on (— oo, a] and nonincreasing on [6, oo] for some a < b, fo is automatically bounded and 
bounded away from zero on [a, b] provided it is continuous and no- where zero on [a, b]. 

4. The NL-LVM model 

Consider the nonlinear latent variable model, 

yi=KVi) + £i, e i ~ N(0, er 2 ), (i = 1, . . . ,n) (4.1) 
M~n„, a~n CT , ^~U(0,1), (4.2) 

where jy^'s are subject specific latent variables, (i € C[0, 1] is a transfer function relating 
the latent variables to the observed variables and is an idiosyncratic error specific to 
subject i. The density of y conditional on the transfer function /i and scale a is obtained 
on marginalizing out the latent variable as 

f 1 

fivw,^) = UAv) = 4><r(v - fJ-{x))dx. (4.3) 
Jo 

Define a map g : C[0, 1] x [0, oo) — > J- with g(fi,o~) = f^.a- One can induce a prior II 
on J- via the mapping g by placing independent priors 11^ and Ii a on C[0, 1] and [0,oo) 
respectively, with II = (11^ <g) Tl a ) o g . Kundu and Dunson (2011) assumed a Gaussian 
process prior with squared exponential covariancc kernel on [i and an inverse-gamma 
prior on a 2 . 

It is not immediately clear whether the class of densities / M cr in the range of g encom- 
pass a large subset of the density space. We provide an intuition that relates the above 
class with convolutions and is crucially used later on. Let /o be a continuous density 
with cumulative distribution function Fgit) = J_ fo(x)dx. Assume fo to be non-zero 
almost everywhere within its support, so that Fo : supp(/o) — > [0, 1] is strictly mono- 
tone and hence has an inverse F _1 : [0, 1] — > supp(/o) satisfying Fo{i 7 ' ~ 1 (i)} = t for all 
t € supp(/o). If supp(/o) = R, then the domain of Fq 1 is the open interval (0, 1) instead 
of [0,1]. 

Letting fio(x) = F Q 1 (x), one obtains 

UoAv)= / My-F -\x))dx= / Mv-t)Mt)dt, (4.4) 

JO J-oo 

where the second equality follows from the change of variable theorem. Thus, / MOl(T (y) = 
4>o * /oi i-C-i fno,o- is the convolution of fo with a normal density having mean and 
standard deviation a. It is well known that the convolution <p a * fo can approximate fo 
arbitrary closely as the bandwidth a — > 0. More precisely, for fo £ L P (X) for any p > 1, 
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\\(f>a- *fo~ fo\\ Pi x ^ as (T ^ 0. Furthermore, a stronger result ||</v * fo - foW^ = 0(a 2 ) 
holds if /o is compactly supported. A similar result holds for the Hellinger metric, with 
the precise approximation error under Assumption 3.1 given by h{<j) a * /o,/o) = 0{°~ 2 ) 
as a — > 0. 

Suppose the prior 11^ on fj. has full sup-norm support on C[0, 1] so that Pr(||// — fj,* < 
e) > for any e > and /i* € C[0, 1], and the prior H a on a has full support on [0, oo). 
If fo is compactly supported so that the quantile function fi G C[0, 1], then it can 
be shown that under mild conditions, the induced prior II assigns positive mass to ar- 
bitrarily small L\ neighborhoods of any density /q. When f has full support on R, 
the quantile function fi is unbounded near and 1, so that H/Woll™ = 00 ■ However, 
Jg 1 |/io(i)| dt = J R \x\ fo(x)dx, which implies that fxo can be identified as an element of 
Li[0, 1] if fo has finite first moment. Since C[0, 1] is dense in £i[0, 1], the previous con- 
clusion regarding L\ support can be shown to hold in the non-compact case too. Wc 
summarize the above discussion in the following theorem, with a proof provided in the 
appendix. 

Theorem 4.1. //n M has full sup-norm support on C[0, 1] and H a has full support on 
[0, oo), then the L\ support of the induced prior IT on T contains all densities /o which 
have a finite first moment and are non-zero almost everywhere on their support. 

Remark 4.1. The conditions of Theorem 4.1 are satisfied for a wide range of Gaussian 
process priors on fi (for example, a GP with a squared exponential or Matcrn covariancc 
kernel) . 

Let A denote the Lebesgue measure on [0, 1], or cquivalcntly, the U[0, 1] distribution. 
For any measurable function fi : [0, 1] — > R, let denote the induced measure on (R, B), 
with B denoting the Borel sigma-field on R. Then, for any Borel measurable set B, 
Vn(B) = \(^ 1 (B)), where /i _1 (-B) = {x e [0, 1] : (i(x) G B}. By the change of variable 
theorem for induced measures, 



so that f^a- can be expressed as a kernel mixture form as in (1.1) with mixing distribution 
v^. It turns out that this mechanism of creating random distributions is very general. 
Depending on the choice of fi, one can create a large variety of mixing distributions 
based on this specification. For example, if \i is a strictly monotone function, then is 
absolutely continuous with respect to the Lebesgue measure, while choosing /i to be a 
step function, one obtains a discrete mixing distribution. However, it is easier to place a 
prior on /i supported on the space of continuous functions C[0, 1] without further shape 
restrictions and Theorem 4.1 assures us that this specification leads to large L\ support 
on the space of densities. 




(4.5) 
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5. The compact case 

We first consider the case where fo is compactly supported, i.e., there exist — oo < ao < 
60 < 00 such that J a ° fo(x) = 1. In that case, the quantile function F^ 1 : [0, 1] — > [ao, 60] 
is a continuous monotone function inheriting the smoothness of fo- Denote the quantile 
function by /io. Assumption 3.1 ensures that the compactly supported density decays 
smoothly at the boundaries. Under Assumption 3.1 and the fundamental theorem of 
calculus, /io : [0, 1] — > [ao, 60] is thrice continuously differcntiablc implying /io S C 3 [0, 1]. 

5.1. Prior specification 

We now mention our choices for the prior distributions 11^ and n CT . 

Assumption 5.1. We assume fi follows a centered Gaussian process denoted by GP(0, c), 
with a squared exponential covariance kernel c(-, •; A) and a Gamma prior for the inverse- 
bandwidth A. Thus c(t,s;A) = e^*^ 2 ,t,s £ [0,1], A - Ga(p,q). 

Assumption 5.2. We assume a ~ IG{a a ,b a ). 

Note that contrary to the usual conjugate choice of an inverse- Gamma prior for cr 2 , we 
have assumed an inverse-Gamma prior for cr. This enables one to have slightly more prior 
mass near zero compared to an inverse- Gamma prior for cr 2 , leading to the optimal rate 
of posterior convergence. Refer also to Kruijer, Rousseau and van der Vaart (2010) for 
a similar prior choice for the bandwidth of the kernel in discrete location-scale mixture 
priors for densities. 

5.2. Posterior convergence rate for the compact case 

We state below the main theorem of posterior convergence rates. 

Theorem 5.1. If fa satisfies Assumption 3.1 and the priors II M and Tl a are as in 

Assumptions 5.1 and 5.2 respectively, the best obtainable rate of posterior convergence 
relative to h is 

e„ = n~Mogn. (5.1) 

The proof of Theorem 5.1 is based on Ghosal and van der Vaart (2007) and 
van der Vaart and van Zanten (2007, 2008b, 2009). Unlike the treatment in discrete mix- 
ture models (Ghosal and van der Vaart, 2007) where a compactly supported density is 
approximated with a discrete mixture of normals, the main trick here is to approximate 
the true density fo by the convolution <$> a * fo and allow the prior on the transfer function 
to appropriately concentrate around the true quantile function /io € C[0, 1]. 
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To guarantee that the above scheme leads to the optimal rate of convergence, we first 
derive sharp bounds for the Hellinger distance between /^ 1)<7l and / M2>< r 2 f° r Mi,A*2 S 
C[0, 1] and o~\,<T2 > 0. We summarize the result in the following Lemma 5.1. 



Lemma 5.1. For /ii, € C[0, 1] and <7i, o"2 > 0, 



^ (//il ,CT1 J //i2,<T2 ) — 1 



2(7i(7 2 



■ exp 



'i 



4((7 2 + a 2 ) 



(5.2) 



Proof. Note that by Holder's inequality, 

i ni,&i \y )j fj.2 



Ui,a 1 {y).f li2 ,a 2 {y) > \ / (y - m ( X ))V ^ 2 (y - ^{x))d2 



Hence, 



h (//J!,(Ti J ffJ.2^2 ) — 



dy. 



V </><ti (2/ - MiW) xAta (y - ^(x))dx 



By changing the order of integration (applying Fubini's theorem since the function within 
the integral is jointly integrable) we get, 



h (ffJ.lM! , ffJ,2,<J 2 ) — I h> (//il(x),(7U ffJ,2(^),0- 2 )dx 
™1 



1 - 



2(71(72 



exp 



4((7?+(7 2 2 ) 



< 1 - 



2(71(72 



■ exp 



4((7?+^) 



□ 



Remark 5.1. When 01 = C7 2 = cr, h 2 (fp lia , f^,*) < 1 - exp { - /i 2 
which implies that h 2 {f^^, / M2 , ff ) 3 ||/ii -M2||^/c 2 . 

Remark 5.2. Note that if we had used h 2 (f Ml!lTl , f^,^) < ll/jui.o-i - /^.^ Hi, we would 
have obtained the cruder bound 

7,2^ f \ ^ ^ llMl -M2|| 1 n \<?2 ~ Ojj 

((7lA(7 2 ) ((7lA(7 2 ) 

which is linear in ||/i.i — yn2 for some constant Ci,C2 > 0. This bound is less sharp 
than what is obtained in Lemma 5.1 and docs not suffice for obtaining the optimal rate 
of convergence. 
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To control the Kullback-Leibler distance between the true density fo and the model 



fucr, we derive an upper bound for log 



Lemma 5.2. If fo satisfies Assumption 3.2, 



log 



fo 



ffJ.,<7 



<C 6 



in Lemma 5.2. 



II A* - Moll 



(5.3) 



for some constant Cg > 
Proof. Note that 

fnAv) = 



1 



> 



2na 
1 



exp 



exp 



(y - 



2a 2 

(y - m(z)) 2 



dx 



dx exp < — 



|M - Moll 



II A* - Moll 



> C 5 /o(?/)exp 



llM - Moll 



where the last inequality follows from Lemma 6 of Ghosal and van der Vaart (2007) by 
Assumption 3.2. Hence log -A- < C e + ^-^^ 



for some constant Ce. > 0. 



□ 



Remark 5.3. Note that if fo is compact then Assumption 3.2 is automatically satisfied. 

Proof of Theorem 5.1: Following Ghosal, Ghosh and van der Vaart (2000), we need 
to find sequences e n ,e n — > with nmin{4,4} ~ ^ 00 such that there exist constants 
Ci, C2, C3, C4 > and sets T n C T so that, 

\ogN(e n ,F n ,d) <C in el (5.4) 

ITO < C 3 exp{-ni 2 n (C 2 + 4)} (5.5) 

n (f^ : y /o log A- < e ~2 , | /q log f A.^ < > C4 ex p{-C 2 ne2 }. (5.6) 

Then we can conclude that for e n = max{e„, e„} and sufficiently large M > 0, the 
posterior probability 

n n (/^ CT : /o) > Me n \Y u . . . ,Y n ) -> a.s. P /o . 

Let W = (Wt ■ t € R) be a Gaussian process with squared exponential covariancc 
kernel. The spectral measure m lu of W is absolutely continuous with respect to the 
Lebesgue measure A on I with the Radon-Nikodym derivative given by 



dm. u 
~dX 



(x) 



1 

27T 1 /2 I 



-x 2 /4 
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Define a scaled Gaussian process W a = (W at : t £ [0, 1]), viewed as a map in C[0, 1]. 
Let H a denote the RKHS of W a , with the corresponding norm |j-|| H a- The unit ball in 
the RKHS is denoted H". We will consider the Gaussian process /i ~ W A given A, with 
A ~ Gamma(p, g) . 

We will first verify (5.6) along the lines of Ghosal and van der Vaart (2007). Note that 



(5.7) 



Since / MOl(T = * fo, using Lemma 4 of Ghosal and van der Vaart (2007), one obtains 
under Assumptions 3.1 and 3.2, 



h 2 (f ,U .a)^O(a 4 ). 
From Lemma 5.1 and the following remark, we obtain 



-< 



II A* - Mo|| 



From Lemma 8 of Ghosal and van der Vaart (2007), one has 

fo 



Jo_ 

f H,<? 



<h'(f ,U,a)[l + lo 



ft. a 



(5.8) 



(5.9) 



(5.10) 



for % = 1,2. 

From (5.7)-(5.10), for any b > 1 and ~e\ = a*, 



[a £ [a n ,a n + a b n ], - ^IL 3 cr 3 } C 



ffi,<7 



fo log ^, erf 



/olog 



Jo_ 

fji,cs 



< at 



Then (5.6) will be satisfied with e„ = n s if 

P{(T g [o-„,2ct„], - AiolL 3 fn) > exp{-C 4 n5} 

for some constant C4 > 0. 

Since /io <G C* 3 [0, 1], from Section 5.1 of van der Vaart and van Zanten (2009), 

P(||M-MolL < 2S n ) > C 5 exp{-C 6 (l/<5„) 1 / 3 }(C 7 /<5„)f/ 3 , 

for <5 n — > and constants C5, Ce, C7 > 0. Letting 5„ = er 3 , we obtain 

P(||M - Molloo < 2d„) > exp{-C 8 (l/a„)}, 
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for some constant Cg > 0. Since a ~ IG(a a , b a ), we have 



P(a e [a„,2a n ]) = J^r /"V^V^dx 

Ua„ />2cr„ 

> / e- 2b "' x dx 
b a " 

> rt a r cr n cxp{~b a /a n } 

> exp{-C 9 /cr„}, 

for some constant Cg > 0. Hence 

P{ct € [er„,2er„], ||/i - /ioll^ ^ (7 r 3 >} > exp{-C 4 n5}, 

with cr„ = e n = n~i and for some C4 > 0. 

Next we construct a sequence of subsets T n such that 5.4 and 5.5 are satisfied with 
e n = rri log* 2 n and e n for some global constant t 2 > 0. 

Letting Bi denote the unit ball of C[0, 1] and given positive sequences M n ,r n ,£ n , 
define 



B n = \M n J-^M\ + JJBij U U 0<Cn (M„H?) + 5 J 

as in van der Vaart and van Zanten (2009), with 5 n = e n l n /K 1 ,K 1 = 2(2/tt) 1 / 2 and let 
?n = ■ V G B n , l n < a < h n }. 



First we need to calculate N(e n ,J r n , 1 1 - 1 1 1 ) . Observe that for er 2 > 01 > 



£1 

2 ' 



ll/jUi,tri //^czlll — (^j^j 



2\ 1/2 ||Mi - Malloo , 3(cr 2 - cr x ) 



(7l (7l 



Taking k„ = min{^, 1} and cr"j = i n (l + n n ) m , m > 0, we obtain a partition of [Z n , h n ] 
as Z„ = (7q < cr" < • • • < (7^ lii _i < h n < cr™ n with 

m„=flogM 1 +1. (5.11) 
V ( n / log(l + Kn) 

One can show that 3 ^™ n = 3k„ < e„/2. Let fc = 1, . . . , N(5 n , B„, IHI^)} be 

a <5„-net of Now consider the set 

{(ffi,o%) :k=l,..., N(S n ,B n , |HU, < 771 < m„}. (5.12) 

Then for any / = / M . CT G ^-"„, we can find (/i^, cr^) such that — /ij! || < 5 n . In addition, 
if one has ct € then 

f^k'^m Hi — 
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Hence the set in (5.12) is an e„-net of T n and its covering number is given by 

m n N(S n ,B n , IHIoJ. 

From the proof of Theorem 3.1 in van der Vaart and van Zanten (2009), for = S n / (2rM n ) 
and for any M n , r n with 

MfJ 2 y/2^ > 2S 3 J 2 , r n > a , M n y/\\m w \\ > S n , (5.13) 

we obtain 

logiV(3^,S„, IHI^) < iT 2 r„^log ^ ^ jj +21og J (5.14) 

where r 2 = J m x 2 dm w (x) and Uracil is the total variation norm of the spectral measure 
m w . 

Again from the proof of Theorem 3.1 in van der Vaart and van Zanten (2009), for 
sufficiently small 6 n and r n > 1 and for M 2 > \&K$r n (log(r n /^„)) 2 , we have 

P(W A i B„) < Kir*- 1 exp{-if 5 r„} + exp{-A/ 2 /8} (5.15) 

for constants K3, K4, K$ > 0. 

Next we calculate P(a ^ [l n , h n ]). Observe that 

P{o i [l n , hn]) = P^ 1 < h- 1 ) + Pia- 1 > I' 1 ) 

< Y - }, aK > + / e- b °*' 2 dx 

r(a CT ) 

Thus with h„ = 0(exp{n^ 5 }),l n = 0(n-^ 5 ),r n = 0{r^l%M n = C^n 1 / 10 log n), 
(5.15) and (5.16) implies 

U(F n ) = expl-lnn 1 / 5 } 

for some constant Kq > guaranteeing that (5.5) is satisfied with e n = n~ x / b . 

Also with e„ = n~ 2 / 5 (logn), (5.13) is satisfied and it follows from (5.11) and (5.14) 
that 

\ogN{e n ,F n , || - |l 1) < A' 7 n 1/5 (logn) 2 

for some constant K7 > 0. 

Hence max{e„, e„} = n~ 2 / 5 logra. 
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6. The non-compact case 

The analysis of the non-compact case poses greater technical difficulties compared to 
the compact case, especially in verifying condition (5.6). Recall that in the compact 
case, K(f ,f^ a ) < 0(er 4 ) + \\fJt - MoIlL / a ' 2 - However, if supp(/ ) = K, then the cor- 
responding quantile function /j,o has H/zolloo = oo. This prohibits us from bounding 
/ /olog(/o//p,<r) 4 , i = 1,2 using Lemma 5.2, since no prior for /j supported on C[0, 1] 
can concentrate around arbitrarily small neighborhoods of the true quantile function /j,q 
in sup-norm. Since the tail behavior of /o has a one-to-one correspondence with the be- 
havior of no near the boundary, we make additional assumptions on the tails of /o similar 
to (C3) in Kruijer, Rousseau and van der Vaart (2010). 

Assumption 6.1. fo has exponential tails, i.e., there exist positive constants T, M, n, T2 
such that 

f (x) < Me X p(-n|:EH, |x|>T. (6.1) 

Remark 6.1. Remark 3.1 suggests that under Assumption 6.1, wq behaves like a poly- 
nomial near the tails and hence Assumption 3.1 is automatically satisfied as long as Wq 
or equivalently /o is twice continuously diffcrentiable. 

To derive concentration inequalities for the prior on fj,, it is convenient to work with 
a series prior for fi as follows: 

Assumption 6.2. For an orthonormal basis {4>j}°°= of L/2[0,l], a sequence of scales 
Xj 4- 0, a fixed domain-rescaling integer a, a global scaling factor b > and a truncation 
level J , consider a prior distribution for fi given by an orthonormal series expansion 

J 

w J {t) = Y J ^z ] b^{at),te[o,i]. 

In the sequel we will chose sequences for J, a and b given by J n — 0(n 1 / 5 ),b n — 
n _1//10 (log n) 1 / 2 and a n = n a for some a > 1 to attain the optimal rate of convergence. 

Remark 6.2. Let W2,g[0, 1] denote the Sobolev space of -L2IP, 1] functions / whose 
weak partial derivative of order q, D q f G £2(0, 1]. Also, for C > 0, denote 

W? q [0, 1] = {/ G L 2 [0, 1] : 11/11* , = ps/lla < C} (6.2) 

to be the set of functions in H^,g[0, 1] norm-bounded by C . In the sequel, we shall assume 
that fij's are given by a cosine basis 

Mt) = ^ (6-3) 

&(t) = cos(27rji), j > 1 (6.4) 

(6.5) 
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so that for f{t) = Y^o 6 jHj(at), one has ||/||^ g = b 2 J2%i 2 (2iraj) q . The techniques 
used subsequently can be easily extended to other orthonormal bases. 

We are now in a position to state the main theorem of posterior convergence rates for 
the non-compact case. 

Theorem 6.1. If fo is twice continuously differentiable, satisfies Assumptions 3.2 and 
6.1, and the priors IT^ and T\ a are as in Assumptions 6.2 and 5.2 respectively, the best 
obtainable posterior rate of convergence relative to h is 

e„ = n-f(logn) to , (6.6) 

for some global constant to- 



The construction of the sieves T n is similar to the compact case and we shall omit 
the details of calculating the entropy and complement probability of T n as they are 
essentially similar to the proof of Theorem 5.1. Verifying the KL condition in 5.6 is the 
biggest hurdle in the non-compact case; we briefly outline the steps needed to bound the 
integrals within parenthesis in 5.6. The basic idea is to separate the integrals into an 
integral over a compact set and its complement. Inside the compact set, one can replace 
fo by a compact approximation fo a and approximate the quantile function [io<j of foa 
by an infinitely smooth function on [0, 1] in an appropriate sense, which enables one to 
obtain the right concentration rate using a smooth prior on C[0,1]. The complement 
term can be handled by exploiting the exponential tails of fo. 

To elaborate, first define sets E a = {x : f (x) > cr Hl }, E' a = {x : f (x) > a H ' 2 }. 
Clearly, E a C E' a if H 2 > Hi. Without loss of generality, one can assume E' a = [d a ,e a ] 
by Assumption 3.2. Let go a = fo^E' denote the restriction of fo to the compact set 
E' a and let fo a be go<j normalized to make it a density supported on E' a . Further, let 
^Oer : [0, 1] — > E' a denote the quantile function of foa and denote /^ 0cr ,cr = 4>a * foa- 

We now bound V(/ , / M , CT ) = J fo log(/o// M .a) 2 , the treatment of KL(/ , f^ a ) follows 
similarly. To start with, observe that 

/obgf f f i og (Ji-) 2 + [ kl0 JlfA\ ( 6 . 7 ) 

\Jn,<j J J \Jp,o,<y J J V Jfj.,<y / 

Using /^o.o- = 4>o * fo, it follows from Lemma 8 of Ghosal and van der Vaart (2007) that, 

/ /ol °g(7^;) 2 ^ /l2 ^'^* /o )( 1 + l0 



4>a * fo 



Since h 2 (fo,4>a * fo) ^ 0(a A ) from 5.8 and CT * fo > C fo by Assumption 3.2, one has 
/ fo log(/o// W ).c) 2 ^ 0(cr 4 ). To handle the second term in 6.7, we break up the integral 
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into integrals over E a and E^. and further decompose the first term to obtain, 

, x 2 
J Ho," 



/olog + /olog Mr 2 ^ + /dog ^ + /o io 



(6.8) 

As mentioned before, we work with a compactly supported approximation fo a of /o on 
E'o-, with the support E' a of fo a containing E a and exploit the exponentially decaying 
tails of fo on E c a . m a in 6.8 is an infinitely smooth function whose choice will be made 
explicit later. We now provide a detailed analysis of each term in 6.8. 

We start with the last term on the right hand side of 6.8. The main idea is to work 
on a norm-bounded subset of the function space where the density function / M)Cr can be 
bounded below and utilize the sub-Gaussian tails of || ^ to bound the integral outside 
the above region. Observe that for ||/x|| < M, 

which implies 



< log C±a J^f ( y )dy + ^ J^iy-Mfdy 

= (kgCfca + ^s) J ec fo(y)dy + ±s y 2 fo(y)dy - £ yf 



(y)dy. 



Now since J EC yi fo(y)dy < a Hl / 2 f EC y-* yj fo(y)dy,j = 0, 1, 2, we need to choose Hi 

satisfying a H ^/ 2 a a 6 /M to make f E . f Q (y) log (¥^f) dy < 0(a 4 ). 

To bound the integral over the set {HmH^ > M}, we provide an upper bound to 
P(||W J || > M) in the following Lemma 6.1, with proof in the Appendix. 

Lemma 6.1. With aj = Y^=i tf, 

, 2 

P(||W J || oo > M) < 2aMexp ' 
for some constant C§ > 0. 



M-Cei-iaog^^ + aogM) 1 / 2 } 
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We now consider the three terms inside the parenthesis in 6.8. Let us start with 
I Ea /olog^o.ff/i^o-) 2 - For y G E a , f ^Av) I U a <,Ay) = fa * Mv)/fa * M(v)- Recall, 
foAv) = ME> a {y)/fa with i' a = J E , f (y). Note that 



fa > 1 - / fo(x)dx > 1 - o- H ^ 2 / V r h(x)d. 



I -a 4 



fa* My) = / My-t).h(t)dt 



for H 2 >8. Now, 



My-t)h(t)dt+ / Mv-t)fo(t)dt. 

J(K)° 
Hence, 

fa* My) = f f {E >„)°My-t)fo(t)dt } 
fa*M(y) ^\ J E ,J a (y-t)f (t)dt J' 

Now, for t G and y G E a , f (t) < a H * < <T H ?- H \My), implying J (K)c fa(y 

t)fo(t)dt < o- H *- H ifo(y). Moreover, 

My - t)fo(t)dt = fa * f (y) - [ fa(y - t)f (t)dt 
>Cfo(y)-a H ^f (y). 



Thus, 



On the other hand, 



fa * Mv) , , A , ^ g2 ' gl 

~ ^ 1 (6.9) 



fa*M(y) ~ V C-a^-^ 



fa* My) i fa* My) 

- Wo 



fa*M(y) a Js^faiy -t)Mt)dt 

>fa = l + 0(<j 4 ). (6.10) 

Hence, from 6.9 & 6.10, one has CT * My) /fa * /oo-(y) = 1 + 0(ct 4 ) for y G implying 
/o log(^ * * / 0ff ) 2 = 0(a 4 ). 

We next turn our attention to J E fo log(/^ 0<JlCr //m CT ,o-) 2 - Ghosal and van der Vaart 
(2007) showed that a Gaussian convolution of a compactly supported distribution can 
be approximated with high accuracy by a finite mixture of normals with "relatively few" 
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mixture components and Kruijer, Rousseau and van der Vaart (2010) obtained a finer 
calibration of their result to handle the above integral. However, it becomes unwieldy 
to use their result in our setup since their approximating density is obtained as the 
convolution of a Gaussian kernel with a discrete mixing distribution with finitely many 
support points, with the corresponding quantile function being a step function on [0, 1] 
with finitely many jumps. Although one can place a prior on fi whose realizations are 
step functions, several issues arise with the posterior computation including choosing & 
updating the number of steps. It would be appealing to use a smooth prior on [i and yet 
obtain a similar approximation result. We borrow techniques from the physics literature 
on maximum entropy moment matching or MAXENT (Mead and Papanicolaou, 1984) 
to develop an approximation result with a smooth mixing measure as follows: 

Lemma 6.2. Let f be a density compactly supported on [—a a , a a ] with a a = ao\ loga] 1 ^ 2 
with a small enough. Then, for any Aq > 0, there exists an infinitely smooth density f mg 
on [— o«t, a a \, such that \\<j> a * f - 4> a * fm„\\ao — <7 ~ 1 cx p(-C| logcr| 1/T2 ). 

A proof of Lemma 6.2 can be found in the Appendix. 

The tail behavior of / implied by Assumptions 3.2 & 6.1 imply E' a C [—a ai a a ] with 
a a = aollogal 1 /^ with flQ = (M2.) 1/t \ 

Let m a be the quantile function of the compactly supported density f m<j in Lemma 
6.2 and let f m a = <f> a * f m<r . Note that for y e E' a , 



Uo, Av) = T~ I My-t)fo(t)dt 

4>a J EL 



L — (7 I 

for sufficiently small a. Using | log a; | < maxjlog \x — 1| ,log \ l/x — 1|} for x > 0, one gets 

2 



E„ 



/dog ^ < fa 



kJ -Je/ V 

for Aq large enough. By choosing Aq sufficiently large and using Lemma 6.2, we obtain 



||//joct,o" fm a , cr || , 



(C/3)a 



H 2 



0(a 4 ). (6.11) 



Finally, we consider the third term j /g (log fm a .al f^.a) 2 inside the parenthesis in 
6.8. Proceeding as in the previous case, we first lower bound f mcT ,a on E' a . In the 
previous case, we already obtained f^ l0a . a ;Z & H2 ■ From Borwein and Lewis (1991), 
\\fm a ~ /oo-lloo = o(fc _1 ) if we match k moments. From Lemma 6.2, we know that 
k ps 0(a~ a |logcr| Q ^ T2 ) and hence by choosing a large enough, we can make fm a ,a o- H ' 2 
on E' a . Now, using the same bound for |loga;| as before, we need \\fm a a — fn,a\\ to be 
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0(a H ) for any H > 0. From Kruijer, Rousseau and van der Vaart (2010) it follows that 
sup y \cf> a (y - /ii) - ~ f*2)\ 3 ltl ^ A so that 

Wc shall now exploit the infinite differentiability of m a to view it as an element of W^, q 
for some large q and calculate the probability of the Gaussian process W J concentrating 
around m a . 

The reproducing kernel Hilbert space (RKHS) of W J consists of the set of functions 
w (t) = Tfj=Q^iHj{ at )^ G [0: !] with RKHS norm 

J w 2 

IHIh = E AS"' 

j=o i 

Since m a is infinitely differcntiablc, m a £ Wp^O, 1] for any g > 1 (with C possibly 
depending on q). Hence there exists {muj}^ such that 

oo 

m <j{t) = y^^m rT jb(/)j(at), te [0,1]. 

Consider the projection of m a on the RKHS of W J as 

./ 

3=0 

In the sequel, we will choose a q > 1 and sequences of integers J„ f oo, a n , b n to achieve 
the optimal rate of convergence. To that end, we calculate the prior concentration of W J 
around for a fixed J with \j = j~ 9 ^ 4 for j > 1. Recall that the prior concentration 
function of the Gaussian process W J around is given by 

4> mi {e)= inf rt-logPdlWI^^c). (6.13) 

Lemma 6.3. For q > 16, 

A> (\< 1 II ■'ll 2 + 



For a proof, refer to the Appendix. 

Recall that we need the concentration bound for m CT , while in the above lemma, we 
obtained the concentration bound for m'l. We thus need error bounds on how well the 
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truncation m J a approximates the function m a . Noting that tti 2 a < Hn-o-H 2 , q (aj) 



< 



< 



j=J+l 



< m 



(«J) 



"(9/2-1) 



(6.14) 



To bound the final term in (6.14), we provide an upper bound to Hnvlbq in the 
following Lemma 6.4. 



Lemma 6.4. 



\m*\\l q <°- (2q - 1)H2 - 



Proof. Exact derivation of the bound is quite tedious and we shall only sketch the main 
steps of the proof. Recall that m a (x) = F~*(x),x <G [0, 1] where F m<r : {—a a7 a a } — > [0, 1] 
given by F m<r (x) = f* a f m<j (t)dt. Then 



'0-112,9 



{{F-l)^{x)Ydx 



(6.15) 



Observe that 



(F-l)'(l) 



1 



(F- ct 1 )"(1) = - 



Proceeding like this, one has {f m , 7 {o, (T )\ 2q 1 in the denominator for (F m ^)( q \l) and 
{/m„(-a.)} 2rl for (F-i)(^(0). The numerator terms of the above expression are bounded. 
From Borwein and Lewis (1991), we know that \\f m „ ~ /octIIoo = o{k~ r ) if we match k 
moments. From Lemma 6.2, k w 0(p~~ a |log <j\ a ^ T2 ) and hence by choosing a large enough 
we can make f m ,{o,a) > ha{a a ) - a H2+1 and f ma {-a a ) > f 0a (-aa) - (r H ' 2+1 which im- 
plies, 



(6.16) 



Noting that Jv{(F m l)^{x)} 2 dx £ max^^ 1 )^ (0)} 2 , {(F~l)^ (l)} 2 }, the conclusion 
follows immediately. □ 



We are now in a position to complete the proof of Theorem 6.1. 
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Proof. Since we can bound the numerator of the rhs of 6.12 as 

\\W J - maW^ <\\W J - miW^ + \\m a - , (6.17) 

we need \\m a — m'lW^ = || m o-j| 2 q ( a J) - ^/ 2-1 ' to be 0(<7 H2+4 ) so that the fourth term of 
6.8 is 0(cr 4 ). Next, we calculate the prior probability P(||M /J — m CT|| oc < <J H2+4 )- Using 
Lemma 6.3, we can see that if 

M n a^' 2 = 0(a 6 n ), (6.18) 
P(||^ J L >M n ) = 0(e- 1 / ff "), (6.19) 
|| 2 , 9 KJ„)-^ 2 - 1 )=0(a^+ 4 ), (6.20) 
(W" 2+4) ) 2 ° /<? = 0(O, (6.21) 
-^W Kg =0(0. (6-22) 

Un<J n 

J n = 0(0, (6.23) 



l m o-!l2, 9 



and a € [a n ,2o~ n \, then 



fo log -p- < o-i, [ f log ( -A. ) < a 4 



> P( ff 6 [<r„,2<7„], ||W - m^l^ < ^ +4 , IIWI^ < M n ) 

>0(a-- 1 |bg<r„|). 



Next we make specific choices for a n , b n , M n and <r„. Clearly a n = 0(n -1 / 5 ) for the 
optimal rate. 6.18-6.23 determine the values of the sequences M n , a n , b n and q using the 
upper bounds on H^vl^g ail d -f(||^ /rJ || 00 > M n ) provided in Lemma 6.4 and Lemma 
6.1 respectively. It can be verified that 6.20, 6.21 and 6.22 are satisfied with q greater 
than the positive root of the quadratic equation 

(9/10)g 2 - (10i? 2 + 24/5)g - 2(2H 2 + 7) = 0, (6.24) 

which is satisfied by q sa 95. Choosing q = 150, Hi = Hi = 12, a„ = n a for some 
a > 1, AI% = O(logn), b„ = O(n _1 / 10 (log n) 1 / 2 ), we can see that the convergence rate is 
e„ = ?i~ 2 / 5 log* n for some global constant t$. □ 



7. Special cases 

A desirable property of any nonparametric model is that it can "collapse" back to a 
simpler structure when the additional flexibility is not warranted. For example, the non- 
parametric prior may be centered on a smaller class of densities (e.g., a parametric 
family) , with a faster rate of convergence obtained when the true density falls within this 
smaller class. In this section, we study such collapsing behavior in a couple of cases. 
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7.1. Properly centering the prior leads to parametric rate of 
convergence 

We have already noted in the introduction that one can center the non-parametric model 
on a parametric family by centering the prior on the transfer function on a parametric 
class of quantile (or inverse c. d.f.) functions {Fg 1 : 9 G 9}. Here we show that if our 
guess about the true density fo is correct, we can actually achieve a parametric rate of 
convergence by centering the prior for [i on Fq~ . Centering the prior on the true quantile 
function expands the RKHS to include the best approximation which is the true quantile 
function itself. We formalize our result in the following Theorem 7.1. 

Assumption 7.1. Define, fiQ to be F^ 1 . We assume /i follows a Gaussian process 
GP(p,o, c) centered at /xo and with a squared exponential covariance kernel c(-, •; A) and a 
Gamma prior for the inverse-bandwidth A, so that, c(t, s; A) = e^^ 1 ^^ ,t,s € [0, 1], A ~ 
Ga(p,q). 

Assumption 7.2. n - 1 /C logn ~ IG{a, b). 

Theorem 7.1. If fo is compact and satisfies Assumption 3.1 and the priors and 
Ii a are as in Assumptions 7.1 and Assumption 7.2 respectively with the correct centering 
fo, the best obtainable posterior rate of convergence relative to h is 

e n =n-5(logn)*°. (7.1) 

for some global constant to- 



Proof. The portion from which the proof differs from the proof of Theorem 5.1 is the 
calculation of the prior concentration. Let /.t = /io + W where W ~ GP(0,c). It is easy 
to see that 

P{\\ft - MolL < 2e) = PiWWlL < 2e) > e~ c ^ ^ 2 ^P(a <A< a,). 
Hence with a n = n" 1 / 4 , l n = nT 1 ^, h n = rfi for some j3 > 0, we can show that 
P(a E [o- n ,2o- n ]) > expj-Oogn)* 1 } 

for some > 0. 

Define the Gaussian process sieve to be 

B n = fa - 



M n J-^Ul + WBr ) U ( U a<e „ (M„HJ) + S n 



where the sequences £, n ,Sn,M n are exactly as specified in the proof of Theorem 5.1. 
It follows from the proof of Theorem 5.1 that e„ = ?T, _1 / 2 (logn)* for some constant 
t > 0. □ 
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Remark 7.1. The extension to the case when the true density is actually non-compact 
and satisfies the tail condition in Assumption 6.1 can be handled following the steps in 
the proof of Theorem 6.1 with n M in Assumption 6.2 centered at the non-compact fo for 
appropriate choices of sequences a n ,b n and Xj and Ii a as in Assumption 5.2. However 
one can show that the best obtainable rate of convergence can only be made arbitrarily 
close to the parametric rate in the sense that the rate of convergence would be slower 
compared to the parametric rate by a factor n" where f3 can be arbitrarily small. 

Remark 7.2. A more interesting and practical extension of Theorem 7.1 is the case 
where one has correctly guessed a parametric family which contains the true density. 
Suppose the parametric family is given by {fg : 8 € MP} indexed by the parameter 8 
living in some Euclidean space, and the true density fo = fg a . It is natural then to center 
the prior for \i on F7 , with a hyperprior on 8 quantifying uncertainty about the value of 
the finite-dimensional parameter 8. A straightforward application of Theorem 7.1 shows 
that it is possible to attain parametric rate of convergence under the same assumptions 
as in Theorem 7.1 if the prior on 8 has full support on W and the mapping 8 — > fg 
satisfies mild regularity conditions, e.g., as in Ghosal, Ghosh and van der Vaart (2000). 
In particular, one obtains the near parametric rate if 8 ~ N p (/j,o, So)- If the prior guess 
about the parametric family is incorrect, then one would still get the near minimax rate 
for the class of twice continuously diffcrcntiable densities. 

7.2. True density is a Gaussian convolution with a finite mixture 
of truncated Gaussians 

Ghosal and van der Vaart (2001) showed that when the true density is a location-scale or 
location mixture of normals f — <p (~^) FoiHi &) with the scale parameter lying between 
two fixed numbers and the mixing distribution Fq being either compactly supported 
or having sub-Gaussian tails, a Dirichlct process mixture of normals can achieve near- 
parametric rate of convergence. To mimic the above super-smooth case for our non-linear 
latent variable model, we shall consider a simplistic situation when the true density is 
a Gaussian convolution with a finite mixture of truncated Gaussians with the same 
truncation bounds. We show below that the rate of convergence in that case can be 
as close as possible to the parametric rate. The actual super-smooth case would be the 
situation when the true density is a finite mixture of Gaussians, so that it can be expressed 
as a convolution with a finite mixture of Gaussians. Remark 7.3 discusses very briefly 
about that case. 

Theorem 7.2. Given any a > 0. If fo is fio-o * /i where f\ is a finite mixture of trun- 
cated Gaussians with the same truncation bounds and the prior 11^ is as in Assumptions 
5.1 and w _ 1/(2e ,° F1) lorrn ~ IG(a,b) respectively, then the best obtainable rate of posterior 
convergence is 

n -2*rlog t °(n) (7.2) 
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where t a is a constant depending on a. 

Proof. Clearly fx is an infinitely smooth density which has quantile function \x\ = f-j 
infinitely smooth in [0, 1] and hence \i\ € C a ([0, 1]) for any a > 0. Observe that 



From van der Vaart and van Zanten (2009), we obtain P(\\(J, — Mi < cr") > exp(— it" 1 ) 
and P(|er — a\ < o-'^ a+1 ) > exp(— cr" 1 ) for a n = n~ 1 /( 2a+1 ). With the same sieve as in 
the proof of Theorem 5.1, it follows that e n = ?i~"/' 2 " +1 )(logn) ta for some constant 



Remark 7.3. The extension to the case when the true density is a finite mixture of 
Gaussians can be handled following the steps in the proof of Theorem 6.1 with 11^ and U a 
in Assumptions 6.2 and Assumption 5.2 respectively for appropriate choices of sequences 
a n , b n and Xj respectively. 

8. Discussion 

Non-linear latent variable models offer a flexible modeling framework in a broad variety of 
problems and improved practical performance has been demonstrated by Lawrence (2004, 
2005); Lawrence and Moore (2007); Ferris, Fox and Lawrence (2007); Kundu and Dunson 
(2011) among others. The univariate density estimation model studied here can be ex- 
tended to multivariate density estimation, latent factor modeling and density regression 
problems; we are currently studying theoretical properties of these extensions building 
upon the results developed in this article in the baseline case. 

In standard Gaussian process regression, the regression function is assumed to be con- 
tinuous on a compact domain, and one can use standard results on concentration bounds 
for Gaussian processes (van der Vaart and van Zanten, 2008b). However, we cannot use 
these results directly as the quantile function of a density supported on the entire real 
line is unbounded near zero and one. To address this problem, we required assumptions 
on the tails of the true density and exploited the interplay between the tails of a density 
and the boundary behavior of the corresponding quantile function. Building a sequence of 
compact approximations to the true density, accurate concentration bounds around the 
corresponding quantile functions (which are in C[0, 1]) are developed for the Gaussian 
process prior on the transfer function. While deriving this bound, one has to carefully 
calibrate the rate at which the RKHS norms of the sequence of approximating quantile 
functions increase to infinity. A truncated series prior is convenient for this purpose, 
however one needs to appropriately rescale the prior as in 6.2 for optimal rate. It would 
be interesting to study whether one obtains the same for a host of other commonly used 




t a > 0. 



□ 
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Gaussian process priors. It should be noted here that posterior consistency with com- 
monly used Gaussian process priors is immediate using our treatment of the non-compact 
case. 

We finally note that although our results assume twice continuously differentiabil- 
ity of the true density, one can obtain optimal rate of convergence for arbitrary de- 
gree of smoothness of the truth. From the treatment of J f log(/o// M , CT ) 2 in (6.8), 
it is not difficult to see that all the terms barring J f log(/o// AIO , CT ) 2 can be made 
0(a H ) for arbitrarily large H. The first term //olog(/o// M , CT ) 2 cannot be improved 
beyond 0(c 4 ) even if the true density is more than twice continuously differentiablc, 
since h(fo,4> a * fo) can only be 0(a 2 ). This is a well-known issue with Gaussian con- 
volutions and one can improve the approximation bound by using a higher order ker- 
nel Vv (Fan and Hu, 1992; Marron and Wand, 1992), so that ||/ - tp a * f \\ = 0(a H ) 
for H arbitrarily large. A thorny issue with using higher-order kernels in the frequen- 
tist literature is that ip a * fo is not guaranteed to be positive everywhere, but one 
can bypass that easily in a Bayesian framework as one only needs to show that the 
prior support contains densities that are appropriately close to the true density. Let- 
ting (f>o- * fi = ipa * fo, one can solve for f\ using inverse- Fourier transforms and one 
has ||/o — (pa * /i|| = 0((T H ). Although J f\ = 1, f\ can be negative at some places. 
Kruijer, Rousseau and van der Vaart (2010) showed that under suitable conditions on 
the true density, one obtains the same approximation error for /2, the positive part of 
fi normalized to integrate to one. Kruijer, Rousseau and van der Vaart (2010) used the 
twicing kernel method (Newey, Hsieh and Robins, 2004) to obtain f\ in a closed-form, 
one can use the same trick here or use other higher-order kernels to obtain f\. One can 
then simply replace [io with the quantile function of fi and proceed with the rest of the 
analysis identically. 

Appendix 

A.l. Proof of Theorem 4.1 

Proof. Let fo be a density with quantile function /io that satisfies the conditions of 
Theorem 4.1. Observe that ||/io|li = J t=0 \^o(t)\ dt — \z\ fo(z)dz < oo since fo 
has a finite first moment, and thus /io £ Li[0,l]. Fix e > 0. We want to show that 
n{B e (/ )} > 0, where B e (f ) = {/ : ||/ - foh < e}. 

Note that no 4- C[0j 1], so that pr(||/i — jUolloo < e ) can be zero for small enough e. The 
main idea is to find a continuous function [Jq close to /io in L\ norm and exploit the fact 
that the prior on /i places positive mass to arbitrary sup- norm neighborhoods of /Iq. The 
details arc provided below. 

Since 1 1 <fi a * f — f 1 1 1 —> as a — > 0, find cry such that \\(f> a * f — / || i < e /2 for 
<j < <7\. Pick any ct < °i- Since C[0, 1] is dense in L^O, 1], for any 5 > 0, we can find a 
continuous function /I such that \\fi - /T || x < S. Now, \\f^ a - f^^Wx < C \\fi - /J || i /& 
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for a global constant C. Thus, for S = eao/4, 

{U.a ■ °o < a < <7i, Hm-MoIIoo < C : ||/o — /m,o-Hi < e }> 

since \\f - /^Id < ||/o - / M0 ,<r Hi + II/mo.o- ^ /mo,<t Hi + 11/^0,(7 - /p^lli and / MOj(T = CT */o- 
Thus, H{B e (f )} > pr(||/i — Molloo < ^) P r (°o < o - < ci) > 0, since II M has full sup-norm 
support and 11^ has full support on [0,oo). □ 



A. 2. Proof of Lemma 6.1 

Proof. From Theorem 5.2 of Adler (1990) it follows that if X is a centered Gaussian 
process on a compact set T C K d and a\ is the maximum variance attained by the 
Gaussian process on T, then for large M, 



PQIXW^ >M)<2N(1/M,T, INI) exp 



-L{M-v(M)}* 



T 



(A.l) 



where u(M) = C 5 f^ /M {logN(l/M, T, \\-\\ )} l '' 2 d{l/M) for some constant C 5 > 0. Ob- 
serve that W J is rescaled to T = [0, a] and the maximum variance attained by W J is 
b 2 a 2 j. Note that N(1/M,T, ||-||) = ail/. Now 

,1/Af 

v{M) < C 6 / {log(aM)} 1/2 d(l/M) 



r-l/M 

< C 6 / {(logoJ^ + QogM) 1 ^}^!^) 



< ^{(loga^ + OogM) 1 /*} 

for some constant C6 > 0. Plugging in the value of N(1/M,T, ||-||) and the bound for 
viM) in A.l, we get the required bound for P(||Vy J || > M). □ 



A. 3. Proof of Lemma 6.2 

Proof. From Mead and Papanicolaou (1984), for any k > 1, we can get an infinitely 
smooth density / m<7 supported on [— o ff ,a ff ] such that 

f " x j f m „(x)dx = f " x j f(x)dx, j = 0, . . . , k. (A.2) 

One possible choice of f m<7 in A.2 has the form / mo .(x) = cx p(~ Xw=i ^0 which corre- 
sponds to the maximum entropy moment matching (MAXENT) density. We shall choose 
k sufficiently large depending on a so that one has the desired approximation result. 
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Consider an interval I a = [— (a a + t a ), (a a + t a )] containing the interval [—a a ,a a ] for 
some t a > to be chosen later depending on a. Observe that 



sup \4> a *f(x) - <j)<T * f m A x )\ ^ su p / 4><t(x- y) \f(y) - f m „{y)\dy < 2(j> a (t a ).(A.S) 



Next, along the lines of Ghosal and van der Vaart (2007) 
<t><r{x - y)f(y) - f ma (y)dy 



sup 



< 2 sup 



. 2d 

< sup 



3=0 
2\k 



2tt 



J 



e\x - y [ 
2ka 2 



< C 2 a 



- { 2k+i)feY (2a a +U) 2k 
k k 



for some global constants C\, C 2 > 0. 

Now choose t a = Aa„ for some constant A > 0. Then, from A. 3 and A. 4, we obtain, 



Ha- * f - <At * /roJL < max 



20a (io-), — exp i 2fclog 



(2 + A)a |logcr| 



1/T2 



k log- 



where -B = e/2. Choosing fe = £?er a | log er 



a/r 2 



(A.4) 



fclog- = B<T- Q |lo g( r| Q/r2 {alog(l/a) + (a/r 2 ) log(|loga|)}, 



and 



2fclog (2 + A) ao |loga| 1/T2 = 2B(j _ Q | logCT |^ {log{(2 + A)ao} + ( i /T2 ) log(|loga|)}. 
a 

Clearly, if a > 2 and cr is sufficiently small, 2fclog ( 2 +- 4 )°°^g- T l 1/r2 < fc log -| . Then by 
choosing a > 2, we can make 2<j> a (t a ) > exp |2fclog V+A)a \\o g af/^ _ fclog fc J and 



hence 



Ha * / - <j>a * / m JL < = ^ exp{-(a A) 2 /2 |loga| 2/r2 }. (A.5) 



Since A is arbitrary, the conclusion of the theorem follows. 



□ 
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A. 4. Proof of Lemma 6.3 

Proof. m J a is contained in the RKHS of W J , 



J 2 

2 || j || t||2 ^ — "v Wlfj j 

inf{lklllH 1 IF " TO Joo < £ } = ll m <r|| H = 2^ — 



\ 2 



j=0 j=0 



Next we calculate P( W J < e) using a technique similar to the proof on Theorem 

J 

3- 



4.5 in van der Vaart and van Zanten (2008a). For any numbers otj > with X)f=o a J — 



l,we have 

J 

PdML^e) > P(ElWI<£) 
i=o 

j 

> n p (i^Ai<«^)- 

Now, define a function / : [0, oo) ->• R given by f(y) = - logP(|Z| < y) = - log{2$(y) - 
1}, where Z ~ N(0, 1). / is a decreasing function and following van der Vaart and van Zanten 
(2008a), / is bounded above by a multiple of 1 + |logy| for y e [0, c] and bounded above 
by a multiple of e~~ y ~/ 2 for y > c for some c > 0. Thus, with otj = (K + j 2 )^ 1 for a large 
constant K > 0, 

J 

-logPfllW'H^e) < Y,^ a ^ q, Vb) + f{e/{Kb)) 

3=1 

rJ f fT q/i 



< ] f[^^)dx + f{e/b{K +!)) + f{e/{Kb)), 

where the last inequality in the above display follows from the fact that / is decreasing 
and the map x i-> x q ^ 4 /(K + x 2 ) is non-decreasing on [1, oo) for any K > as long as 
q > 4. For e small enough so that e/(Kb) < c, f(e/(Kb)) < 1 + \og(Kb/e). 

Now consider two cases to bound the integral in the last display. If eJ q ^ 4 < (K + J 2 ), 
ex q / A /(K + x 2 ) < 1 for x <G [1, J]. Hence in that case, 



loe 



b{K + x 2 ) 



dx (A.6) 



< 1 + log^ i da;< J 1+log-i -. (A.7) 
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On the other hand, if eJ q / 4 /b > (K + J 2 ), 



'U+?J*=UJ x ^ x t ^ifTWiJ^ 1 *- (A - 8) 

The integral above is bounded by 

? fcA-rfar^ 1 * (A ' 9) 

6/(ze 



rb/ [ze ) /-oo / \ 

<z 4/q j o f(x) x ^- 1 dx + J o f[ K + yl6/q )y 4/q ~ ld y (a.io) 

for z = (A' + (fe/e) 16 / 9 ). The first integral is bounded as e ! and the second integral is 
finite for q > 16. Hence for q > 16, 

, , w 1 || jna ^/^(l + logf),eJ 9 / 4 ^6J 2 

□ 
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